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Abstract 

The computation of a piecewise smooth function that approximates a finite set of data 
points may be decomposed into two decoupled tasks: first, the computation of the locally 
smooth models, and hence, the segmentation of the data into classes that consist on the 
sets of points best approximated by each model, and second, the computation of the 
normalized discriminant functions for each induced class (which may be interpreted as 
relative probabilities). The approximating function may then be computed as the optimal 
estimator with respect to this measure field. 

For the first step, we propose a scheme that involves both robust regression and spatial 
localization using Gaussian windows. The discriminant functions are obtained by fitting 
Gaussian mixture models for the data distribution in the boundary of each class. We 
give an efficient procedure for effecting both computations, and for the determination of 
the optimal number of components. Examples of the application of this scheme to image 
filtering, surface reconstruction and time series prediction are presented as well. 
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1 Introduction 

This paper addresses the problem of approximating a 
function z : R d t-* Q, given a finite set of data points 
(examples) {(x,-, z,), x,- G R d , z, € Q, i = 1, . . . , N}. The 
set Q may be a finite set, as in pattern classification 
problems, or may coincide with the real numbers (as in 
function approximation, regression and filtering). 

A natural way of approaching this problem is to fit 
simple local models to the data, which are then put to- 
gether using appropriate weight functions. The fitting 
process, for each local model, involves: first, the assign- 
ment of a relative weight to each data point that indi- 
cates its importance for the determination of the model, 
and second, the computation of the parameters that min- 
imize an appropriate weighted error measure. 

These two steps may be carried out in a decoupled 
fashion, by making the weights depend only on the spa- 
tial distribution of the independent variables {x t }. Ex- 
treme instances of this strategy are the local learning 
algorithms proposed by Vapnik [1], where a different lo- 
cal model is used for each point x in which the function 
is to be evaluated, and the model parameters are made 
to depend only on those data points that are closest to 
x. Another example may be found in the use of Radial 
Basis Functions with fixed knots for function approxima- 
tion; in this case, the location of the knots (and hence, 
the relative data weights) are determined as a function of 
the distribution of {x,}, for example, using the K-Means 
algorithm [15]. 

This decoupled scheme, although computationally 
simple, does not take into consideration the structure of 
the function itself, and therefore, will lead to relatively 
complex local models, so that the approximation will be, 
in general, difficult to interpret; if the local models are 
to reflect the structure of the function, the data weights 
must depend also on the goodness of fit. 

A simple — although computationally expensive- 
way of doing this is to specify parametric models for 
both the weights and the local regressions, and then ad- 
just the parameters in a coupled way by minimizing an 
appropriate cost function. Thus, for example, in [16], the 
cost function is the log-likelihood of a statistical mix- 
ture model, which is minimized using gradient descent; 
the local models ("experts") are multi-layer perceptrons, 
and the weights ("gating networks") are linear functions 
followed by an exponential transformation and normal- 
ization ("softmax units"). The number of components 
is found by setting it to a sufficiently large number, and 
deleting, upon convergence, those components with in- 
significant weights. 

The interpretability of the approximation increases if 
the local models are given a simpler form. Also, it is 
computationally more efficient to start with a single com- 
ponent, and successively add new ones by splitting the 
domain of the one with worst local fit until the global 
fit error is below a given threshold. This is done, for 
example, in Recursive Partition Regression (RPR) [2], 
where the local models are low-order polynomials and 
the weights are step functions on each one of the inde- 
pendent variables. 

The problems with this approach are that the approx- 



imating functions are always discontinuous, and the dis- 
continuities are always piecewise parallel to the coor- 
dinate axes, so that continuous functions, or piecewise 
smooth functions with complex discontinuity shapes are 
not well approximated. This is remedied in the MARS 
approach [5], by replacing the step functions by trun- 
cated power spline basis functions, and by not remov- 
ing the parent basis functions after they are split. The 
problem here is that now the resulting approximation is 
aways globally smooth (so that discontinuous functions 
are not well approximated), and that the representation 
can no longer be expressed as a sum of spatially localized 
simple models, which has a high intuitive appeal. 

The objective of this work is to overcome some of 
these limitations; specifically, our goal is to develop a 
representation that is : easy to interpret; efficient to 
compute, and flexible, in the sense that it can generate 
both smooth and piecewise smooth approximations. It 
is based in the following ideas: 

The most flexible representation has the form of a 
measure field, which is defined by the following elements: 

i) A relatively small set of simple functions, {z k : 
R t— ► Q, k = 1, . . . , m} (the local models). 

ii) A probabilistic description of the regions where 
each local model is a good approximation of the 
function, that is, a set of functions {p k : R d 1— ► 
[0, 1], k = 1, . . . , m} such that Y2k Pfc( x ) = 10 f° r 
all x, where /0jb(x) represents the probability that 
the function is optimally approximated by z k at 
location x. 

The pairs {(p k ,Zk)} define a discrete measure field p 
that represents the function: 



Px(z) = ^2p k (x)6(z - z k (x)) 



(1) 



ik=i 



where S(-) is the usual delta function. 

One may specify different criteria for computing the 
approximation z given the measure field (1). For exam- 
ple, a smooth function may be obtained by taking the 
mean value: 



/m 
zdp x (z) = Y^Pk(x)z k (x) 
fc=i 



(2) 



and a piecewise smooth one by taking the mode: 

z(x) = z r (x) , r = argmax/>jk(x) (3) 

k 

It is also possible to include additional information, such 
as prior conditions on the shape of the smooth patches 
or on the location of the discontinuities, and compute z 
as the optimal Bayesian estimator [12]. 

This type of representation has been proposed in the 
context of machine learning [16] [8], and also for the so- 
lution of a class of computational vision problems [12]. 
It will be adequate if the models for the functions z and 
p are intuitively appealing, and allow for a computation- 
ally efficient implementation. Here, we will use for the 
local models polynomials of low order (0 or 1), so that 



piecewise constant and piece wise linear functions are ex- 
actly represented. The discriminant functions may be 
modeled as normalized sums of Gaussians: 



Pk(x) 



E£iC*i(*)' 



(4) 



where: {(?*,-(•)} are Gaussians; (3 is a positive number 
that controls the smoothness of the approximation, and 
Tik is the number of Gaussians that model the distribu- 
tion of class k. 

For the computation of the corresponding parameters, 
we propose a strategy that combines the computational 
efficiency of decoupling the weight and model computa- 
tion with the power of making the weights dependent on 
the local fit. The scheme is as follows: compute the local 
models in a first step using weights that depend on both 
x and z, that is, using local robust regression; in a sec- 
ond step, classify the data according to the model that 
best approximates the function at each point, and finally 
compute the p functions as the discriminants for this in- 
duced classification task (for classification problems, of 
course, the first 2 steps are not needed). 

The number of components is determined by a strat- 
egy similar to that of RPR; first, components are added 
one at a time, until the error variance reaches an appro- 
priate value. In a subsequent "backwards" step, redun- 
dant components are merged (in the regression phase) 
or deleted (in the classification phase). 

What makes this methodology attractive is the ex- 
istence of efficient algorithms to perform the compu- 
tations. In the next section we will show that by us- 
ing Gaussian windows, local robust regression becomes 
equivalent to the estimation of the parameters of a Gaus- 
sian mixture model (like the one used in the classifica- 
tion stage), and that this problem is equivalent to vec- 
tor quantization with a quadratic distortion measure, for 
which efficient algorithms exist. Since the discriminant 
functions in the classification phase are also modeled as 
Gaussian mixtures, a single computational framework 
may in fact be used for both steps. 

2 Gaussian Mixture Models and the 
Generalized K-Means Algorithm 

Suppose we have N data points in K d , {xj, i = 1, . . . , N}, 
and we want to approximate their distribution with n 
codewords or "centers" {m* G K d ,k = l,...,n} that 
minimize a distortion measure of the form: 

N n 

£ = Yl 5Z d ( Xi ' m *) s »* ( 5 ) 

«=1 Jfc=l 
where d{xi,m,k) is a local distortion measure, such as: 

d(i,k) = \\xi-m k \\ 2 (6) 

The indicator variables su. are given by: 

s ik - ls k (*i) 
with l s denoting the indicator function of set S, and 
where the sets Sk consist on those data points that are 
mapped into codeword k. 

It is possible to show [11] that the following (K- 
Means) algorithm will never increase £: 



1: for k = 1, . . . , n, assign to each m k ' a random lo- 
cation ; 

2: at time t, set 

8$ = l,ifrf( a r 1 -,m? ) )<rf(t,mj t >),i#fc 
= , otherwise 

(with an appropriate rule for solving the ties) and 

N n 



m[ t+1) = arg min ^ ^ d( Xi , u)s^ 
t=i Jb=i 

for example, if d(-, •) is given by (6), we get 



(0 



m 



(*+l) _ E« X i S ik 



Jb — 



(0 



Alternatively, one may consider the data as samples 
from a mixture of populations {Sk, k = 1, . . . , n}, each 
one normally distributed with mean m* and identical 
covariances 4/. 

Assuming that for each i the indicator variables for 
these populations, {s,*:}, are independently and identi- 
cally distributed according with a multinomial distribu- 
tion consisting on one equiprobable draw on n categories, 
the conditional log density for data point x» is thus: 

n 

logp(x,- | s) = -0^2s ik \\xi - mjfcH 2 + K 
k=i 

where K is a constant. 

Assuming that the data x = {x,} given s = {s,jb} are 
conditionally independent, we get that the log likelihood 
for the complete data x and s is: 

n N 

C(m) = -pJ2Y, Sik W Xi ~ m *H 2 + NK ( 8 ) 
fc=i »=i 

We may now find the maximum likelihood estimator 
for m = {mk} by treating s as missing data and applying 
the EM algorithm [4] [7]: in the expectation (E) step 
at time t, the expected likelihood < C > is found by 
replacing the variables {«,&} in (8) with their conditional 
expected value, given x and m^\ This is found using 
Bayes rule: 

E[s ik | x, m<*>] = w^ = Pi( Xi e S k | ij.mW) = 

exp[-/% t --m^|l 2 ] 

Er^expH?!!*.--^ !! 2 ] 

In the M step, we find the value of m that maximizes 
<£> : 

v^ (*) 

(* + l) • V^ (t)n ,,2 l^i w ik x i 

m k = argmm^^VHx,- - m k \\ 2 = ^ '* 

k i E,- w ik 

In the limit as f3 — ► oo, each indicator variable is replaced 
by the mode of the conditional distribution (which co- 
incides with the mean) in the E step, and so, the EM 
algorithm reduces to the K-Means scheme. For finite 



values of /3, it minimizes a distortion measure that is 
obtained from the log likelihood of the data: 

^ = E lQ g ( X>pH%< - ™*ll 2 ] ) 



1=1 



kifc=l 



In practice, the behavior of both algorithms is very sim- 
ilar, although it has been reported that the EM scheme 
tends to produce more uniform center distributions [16]. 
If one uses the EM algorithm, one can estimate, upon 
convergence, the indicator variables Si k by: 

Sik = 1 , if Wik > u>ij for all j 

= , otherwise (9) 

and thus define the domain for each center k as the set 
of points X{ for which S{ k = 1 . 

Suppose now that in addition to the data points {#,} 
we are given the corresponding values {z,} of a function 
that we want to approximate. The problem now is to 
find not only the set of centers {m k } that represents 
the data, but also to associate with each one of them a 
local model z k (x;9 k ) that represents the function z with 
minimum distortion in the domain (Voronoi polytope) of 
the corresponding center. This means that we now have 
to minimize a distortion measure of the form: 

AT n 

S = 5>g5>xp- [P [\\xi - rotll 2 + ASOr,-, *,■;**)]] 



»=i 



*=i 



(10) 
where: $ measures the fit error of the model given by 
the parameter vector 9 k (and possibly additional prior 
contraints on this vector), and A is a parameter. 

To derive the generalized K-Means (GKM) algorithm 
in this case, we consider a mixture model for the joint 
distribution of x and z and obtain the GKM scheme as 
the limit, as /? — * oo of the EM algorithm that estimates 
the vector (m, 9); the main iteration takes now the form: 

1. E Step: compute 



(0 

' J ik 






where 



Qk(xi) = exp [-flU*,- - m<°|| 2 + A<D(* t , zt; 0<°)] 
which in the GKM case reduces to: 



w 



IK 



= 1 , if &(*,-) >&(*•• W* 
= , otherwise 



2. M Step: Set 

(m? +1 \*r>) = 
= arg min ]T wty (||x-; - u|| 2 + A$(x;, z { ; v)) 



The choice of the function $ is determined by the de- 
sired behavior of the centers, and also by computational 
considerations: if it is quadratic in 9, the maximization 
steps in the above algorithms are implemented simply 
by matrix inversion. We now discuss the specific choices 
that are relevant for the determination of the local mod- 
els and for the classification stage. 



2.1 Local Robust Regression 

In this case, a natural choice for $ is the quadratic fit 
error; if the local models are linear, so that z K — a K • x + 
b k , we get: 

$(x,z;e k ) = (z-a k -x-b k ) 2 (11) 

The function $ may be modified to enforce some prior 
constraint; for example, to give higher preference to hor- 
izontal planes (i.e., for local ridge regression [22]) we may 
put: 

$(x,z;9 k ) = (z-a k ■ x - b k ) 2 + v\\a k \\ 2 

With this choice, the M step of the GKM (or EM) 
algorithms is equivalent to the solution of a decoupled 
set of linear systems (one for each local model) whose size 
grows linearly with the dimension of the input space. 

The solutions found by these algorithms may be un- 
derstood, in terms of robust regression, as the best n 
hyperplanes that pass through the data, with the con- 
straint (enforced by the \\x — m k \\ 2 terms) that the sup- 
port for each hyperplane must be spatially localized; this 
constraint is crucial to get good results, since it prevents 
the system from finding suboptimal solutions with non- 
local support. The relative weight of this constraint is 
controlled by the parameter A; if its value is too high, 
the non-local solutions will prevail, while if it is too low, 
the location of the centers will be controlled solely by the 
data density. The solution, however, is not very sensitive 
to the precise value of this parameter; if one scales the 
data so that all X{ are inside the unit hyper cube, and all 
Zi are in the interval [0,1], a value of A between 1 and 5 
gives good results. 

The nature of the solution also depends on the num- 
ber of local models: there should be enough of them for 
the system to escape local minima that correspond to 
non-locally supported hyperplanes. On the other hand, 
if there are too many, one would be, in effect, modeling 
the noise (i.e., overfitting the data) with the correspond- 
ing degradation of the generalization capabilities of the 
system. 

This trade-off between model capacity and generaliza- 
tion error has been widely recognized, and criteria dif- 
ferent from the minimization of the empirical risk have 
been proposed to find an optimal compromise (e.g., min- 
imization of the structural risk [1]; minimum description 
length criteria [20] or cross validation techniques [3]). 

In this case, due to the simplicity of the local models, 
one may use directly as a criterion an approximation to 
the expected value of the fit error. In [6] it is shown that 
this expected value may be decomposed into a bias and 
a variance terms: 

E[(z(x)-z(x)) 2 } = (Ez(x)-z(x)) 2 + E[(z(x)-Ez(x)) 2 ] 
If z() is piecewise linear, and there are enough lo- 
cal models, the corresponding piecewise linear regression 
z(x) is unbiased, so that the mean fit error is controled 
by the variance. For each local model, the variance of 
the predicted mean response z k (p), for any point p inside 
its Voronoi polytope S k , may be estimated as [21]: 

V k (p) = Var(%(p)) = a 2 {±-+{p-x k ) T {XlX k )-\p-x k )) 

"k 

(12) 



where n k is the number of examples inside Sk ; 



°l 



Xk 



1 



= — V] Xi > 



»:»i€5fc 



"ib 



J2 (zi-h(xi))' 



D-l . „ 

where D is the dimension of the input space and the 
(njfe x D) matrix Xk whose rows are (#, — Xk) T for each 
Xi e S k . 

Equation (12) allows one to estimate the reliability of 
the approximation at any point; therefore, one may use 
it to estimate, in addition to the reconstructed surface, 
a corresponding "error map" . 

The optimal number of models is determined by mini- 
mizing the average variance. This may be approximated 
by the mean variance at the centers {xk} of the Voronoi 
polytopes: averaging Vk(xk) over all K centers, and not- 
ing that nj. « N/K , we get: 



V K « K(t 2 /N 



where 



= ^E E> -*(*<»' 



is the empirical risk. 

The complete strategy, therefore, consists in adding 
one model at a time (e.g., by splitting the center with 
worst local fit error) until a maximum number K max is 
reached, and then pick the solution that corresponds to 
the number of centers K that minimizes Vk ■ The num- 
ber K max is related to the minimum number of examples 
that reliably support a hyperplane, and may depend on: 
the dimension of the input space; the total number of ex- 
amples and the noise power. Although it is convenient 
to choose it in a reasonable way, its precise value is not 
critical for the performance of the system. 

Since the models are locally supported, it is possible 
that several of them correspond in fact to the same — or 
very similar — hyperplanes. For this reason, it is nec- 
essary, once a solution is found, to merge together the 
redundant components. To do this, one should find an 
appropriate way of measuring how different two hyper- 
planes are. It may be shown [19] that given the usual 
normality assumptions, the relative MSE reduction ob- 
tained by replacing the two hyperplanes by a single one 
(with the appropriate corrections to compensate for the 
different number of parameters) has a standard F dis- 
tribution and hence, it may be used for testing the cor- 
responding hypothesis. In particular, for each pair of 
models (j, k) one computes the statistic: 



Fik = 



(SS jk -SSj-SS k )/{D+l) 



jk (55,- + 55 t )/(»»,■ + n 4 -2(D + l)) 
where D is the dimensionality of the input space; 

SSj= J2 (*i-2j{*i)? 

i.x,£Sj 

SS jk = Y2 ( Z i ~ Zjk(Xi)) 2 

i'.Xi&SjUS), 



where Zj k is the best (LSE) hyperplane through the 
points in Sj U Sk- 

Each merging step consists in finding the models (j, k) 
with the smallest value of Fjk] this statistic is then com- 
pared with the tabulated value of F with (D + 1) and 
(rij + njb — 2(.D+1)) degrees of freedom at the appropriate 
confidence level; if it is smaller, the 2 models are replaced 
by Zjk and the whole process is repeated until the null 
hypothesis (equality of the regression hyperplanes) is re- 
jected for the smallest F. 

This test may fail if the normality assumption is 
grossly violated (although we have found that it is rel- 
atively robust). In these cases it may be convenient to 
also put an upper bound on the number of linear compo- 
nents that one wants to include in the complete model, 
so that the merging process stops only if the smallest F 
is too large and the total number of components is below 
this bound. 

After the merging process is completed, one associates 
with each data point Xi a class c, that corresponds to the 
model that contains the center k for which the indica- 
tor variable Sik = 1. To compute the desired measure 
field representation, one now has to find the discriminant 
functions for the data {(x,-, c,), i = 1 . . . N}. 

2.2 Local Gaussian Classifiers 

Consider the problem of finding discriminant functions 
that optimally classify the data: {(x,-,Ci),f = 1...N}, 
c» G {1, . . • , Q} for all i. A simple way of doing this is 
to approximate the data distribution within each class 
c with a sum of n c Gaussians whose centers are found 
again with the GKM algorithm [13]. This may be ob- 
tained from the above model by setting A to a very large 
value, and $ to: 

®(x, z;m kc ) = 1 - S(z(x) - c) 

where mk c is the center of the k th Gaussian that approx- 
imates the data distribution of class c. 

One may now define a 1-parameter family of discrim- 
inant functions (indexed by the positive parameter (3) 



p c (x) = 



T,%,Tr r q =,G rq {xy 



(13) 



where 

G kc (x) =\ A kc T 1/2 exp[--(x - m kc ) T A^(x - m kc )] 

The matrices A kc (| A kc | denotes their determinants) 
may all be taken equal to the identity (in which case this 
procedure becomes similar to the LVQ1 method [9]), or 
they may be computed as the covariance matrices inside 
the corresponding Voronoi polytope (see [13]): 



Akc - 



l_ 

"Jbc 



^2 ( x i ~ m kc){Xt ~ mkcf 



Xi€S k < 



(this is usually important in high dimensions with 
anisotropic data). 

For high values of /?, the classifier simply assigns to 
each point the class of the closest center (with a distance 



weighted by A matrices). As (3 decreases, the discrimi- 
nant functions — and hence the inter-class boundaries- 
become smoother, so that (3 controls, in some sense, the 
trade-off between capacity and generalization capabili- 
ties. 

If the classifier is trained in such a way that it achieves 
(near) perfect classification of the data with high /?, this 
parameter may then be used to fine-tune the classifier 
(using, for example cross-validation) to attain low gen- 
eralization errors. 

Training the classifier means finding the minimal 
number of centers that achieve a given classification er- 
ror. For this, one may use the following procedure: start 
with one center per class; upon convergence of the K- 
Means procedure, compute the approximate classifica- 
tion; add a new center at a location that corresponds 
to a misclassified data point in the class with highest 
classification error, and repeat the K-Means process. 

Once the target approximation accuracy (or the maxi- 
mum number of allowable centers) is reached, one should 
"prune" the irrelevant centers (those sample the interior 
of a class, and therefore do not contribute to the cor- 
rect classification of any point). More precisely, a center 
mjfec is pruned if its irrelevance function I(k,c) equals 1. 
I(k,c) is defined as: 

I(k,c) = si Y, (l-8(c-z k cM)\ 

\i:z(x,)=c / 

where z(x{) is the current estimate for the class of Xi, 
and Zkc(xi) is the class that would be estimated if rrikc 
was deleted. Note that for high /? one needs to consider 
in the sum only those points in the Voronoi polytope of 
mjfcc, and Zkc(xi) is simply the class associated with the 
second nearest center to x-, . 

In practice, it is possible to prune the irrelevant cen- 
ters dynamically by computing I(k,c) for all centers at 
each iteration; if a center is found to be irrelevant, it is 
pruned and all the data points inside its Voronoi poly- 
tope are marked as inactive. In this way, one works with 
less and less data points, accelerating the whole process 
(provided that at each iteration one checks the inactive 
points and includes again those that are misclassified). 
We have found that in general this procedure gives the 
same results as the one described above, except when 
there is a very large inter-class overlap (classification 
noise), when the dynamic pruning may fall into a limit 
cycle. 

An interesting by-product of this pruning procedure 
is a secondary class that may be associated with each 
data point, namely, an indicator of its relevance or ir- 
relevance for the primary classification task. One may 
then construct a binary classifier for this secondary class 
which may be useful for guiding the recollection of new 
data, particularly when this involves complex and ex- 
pensive procedures. 

3 Examples 

In this section we present some examples that illustrate 
the performance of these techniques. 



The first set of examples deal with 2-dimensional 
problems. The first one shows that the system cor- 
rectly identifies the piecewise linear structure when it 
is present: panel (a) of figure 1 shows a function z(x, y) 
that is equal to a tilted plane inside an L-shaped re- 
gion and is constant elsewhere. The input to the sys- 
tem consisted on 200 triplets {(x t -,t/,-, z,)}, where each 
(xi,yi) is a randomly selected point on the square and 
Zi = z(xi,yi) + n,, where {n,} is a set of independent, 
identically distributed Gaussian random variables with 
zero mean and variance equal to 0.05. In step 1, the algo- 
rithm found 5 local models, which after the merging step 
were reduced to 2. The second step was thus a binary 
classification problem for which perfect classification was 
achieved with 26 Gaussians (with covariance matrices 
equal to the identity), which after pruning were reduced 
to 17. The reconstructed surface, obtained as the mode 
of the measure field with (3 = 8, is shown in panel (b). 
Note that additional information about the inter-class 
boundary may be easily incorporated, improving the so- 
lution without having to recompute the measure field; for 
example, if one wishes to reconstruct the surface in the 
nodes of a square lattice L, one may include a smooth- 
ness constraint on these boundaries in the form of a prior 
Markov Random Field (MRF) model for the shape of the 
smooth patches (see [12]). Using, for instance, a first 
order MRF model with Ising potentials, one may obtain 
the optimal classification {c,, i E L} as the MPM estima- 
tor [14] with respect to the posterior Gibbs distribution 
with energy: 

^) = E^' a i)-E^( a?/ (*')) 



i»jj 



iEL 



with 



V(a,b) = - T , if a = b 
= 7 , otherwise 

where: [i,j] denotes summation over all nearest neighbor 
pairs of sites in L; 7 is a parameter and xl(i) is the 
coordinate vector of node i. 

Figure 1-c shows the top view of the "true" classi- 
fication; panels (d) and (e) show the maximum likeli- 
hood classifications (obtained by selecting the class with 
largest p c at each node) for /? = 8 and /? = 2, respec- 
tively. The optimal Bayesian (MPM) classification (with 
7 = 10) is shown in panel (f). Note that in all cases all 
the examples are correctly classified; the generalization 
performance, however, will clearly be superior in the last 
case. 

Figure 1 around here 

The second example illustrates the system perfor- 
mance when one of the surface patches is non-linear. 
For this, we use the function studied in [10], which is 
illustratred in panel (a) of figure 2 on the unit square. 
Again, the input to the system consisted in 200 triplets 
{(#», Vi, *»)} with (x{, y.) selected randomly. The system 
found 6 components in step 1 (after merging), and 21 
Gaussians (with covariance matrices equal to the iden- 
tity) in the 6-way classifier of step 2. The reconstructed 
surface is now computed as the mean of the measure field 



(to get a smooth reconstruction). It is shown in panels 
(b) and (c) of figure 2 for /? = 10 and (3 = 2, respectively. 

Figure 2 around here 

3.1 Image Filtering and Segmentation 

An important class of problems arises when one has 
dense, noisy data on a regular lattice (i.e., an image), so 
that no interpolation is needed. In this case, the desired 
result is just the output of step 1. This output may be 
interpreted both as a filtered image (since the fitted lo- 
cal models smooth out the noise) and as a segmentation, 
where each segment corresponds to the domain of each 
local model. Since there is an observation 2, at every 
point where one wants to compute the approximation, 
one may also construct a measure field representation 
by putting: 



time series created by the Mackey-Glass delay-difference 
equation: 



Pk(xi) 



exp [-/3[\\ Xi - m k f + \( Zi - z k (xj)) 2 }} 
JV exp [-/?[||,;, - mi |P + X( Zi - zjixt))*]] 



for each center k (note that after merging one may have 
Zk(x) = Zj(x) if centers j and k belong to the same 
component). By setting the filtered image to the mean 
value with respect to this measure field, one may use 
P to control the amount of inter-component smoothing: 
for large f3 we get an edge-preserving filter and for small 
P a global smoother. The performance of this scheme 
for a piecewise linear, noisy synthetic image is shown in 
figure 3. Note the correct identification of the 3 linear 
components (after merging), and hence, the correct seg- 
mentation, in spite of the fact that the intensity gradient 
between linear regions disappears in some parts of the 
image. The shape of the inter-class boundaries may be 
made smoother (and the isolated pixels eliminated) by 
using a Gibbsian prior and computing the MPM estima- 
tor as above. 

Figure 3 around here 

A similar idea may be used for image quantization; 
suppose one wishes to represent an image with a fixed, 
small number of grey levels. In this case, the local mod- 
els are 0-order polynomials (constants) and the merging 
phase is stopped when the desired number of components 
is reached. The result is illustrated in figure 4, where a 
detail of a real, noisy image is quantized to 5 levels (a 
uniform quantization is also included for comparison). 

Figure 4 around here 

3.2 Time Series Prediction 

This example illustrates the system performance in more 
than 2 dimensions and in a situation where the basic 
model assumptions (piecewise linear structure of the true 
function and additive Gaussian noise) are grossly vio- 
lated; as we show, the performance is still reasonable in 
this case. 

The task is to predict the value of a time series at 
some future time, based in 4 previous values. To be able 
to make comparisons with other approximation meth- 
ods, we use the same example as in [17]: the chaotic 



t/(<+l) = (l-6)y(*)+a 



y(t ~ r) 



l + y(t- r) 



10 



with a = 0.2, b = 0.1 and r = 17. 

Each 4-dimensional example point x, is formed by the 
values of the series at times: T,- —6,7} — 12, 7} — 12 and 
Ti — 18. The corresponding z,- is the value at 7} 4- 85. 
Using a training set of 1000 examples the local robust 
regression finds 26 components, which after merging are 
reduced to 21. For the classification phase, a simple 
Gaussian classifier (i.e., with one center per class and 
P = 1) was used (giving 3.8 % classification error). The 
final normalized rms error is 0.12, which is comparable 
to the accuracy of a standard Radial Basis Function net- 
work [18] performing the same task (with an rms error 
ofO. 1), as reported in [17]. 

The most accurate algorithms reported there: a 
multi-layer perceptron trained with back propagation 
and a particular Resource-allocating network, achieved 
a normalized rms of 0.03). It is worth noting, how- 
ever, that the computational load of our method is very 
light, and the total number of parameters, which is: 
21 x (4 4- 2 -|- 10) = 336, makes this representation more 
compact than all others reported in [17] (all use more 
than 500 parameters). 

4 Summary and Discussion 

We have presented a strategy for approximating a func- 
tion, given a set of examples, in which an intermedi- 
ate representation — a measure field — is computed in 
a first stage. This representation, which consists in a 
set of simple local models with associated probabilities, 
has the advantage of allowing for globally or piecewise 
smooth reconstructions without any additional compu- 
tational effort; also, it permits the incorporation of ad- 
ditional information about the reconstructed function 
(e.g., about the location or shape of the discontinuities, 
etc.) in an easy way. 

For the computation of this representation, we pro- 
posed a procedure in which the local models are found 
in a first step, and induce a segmentation of the data (in 
terms of which local model is supported by each data 
point). The associated probabilities are then computed 
in a second step as the discriminant functions of this 
induced classification task. 

This procedure has the advantage of combining the 
computational simplicity of decoupled methods for find- 
ing the domain of each local model, with the power of 
making the local model selection dependent on the lo- 
cal fit, which permits the use of simpler models that are 
easy to interpret. What makes it specially attractive is 
the existence of efficient algorithms for performing these 
computations; these algorithms are generalizations of the 
K-Means procedure for vector quantization, with the ad- 
dition of terms that measure the goodness of fit. Here, 
they were derived from the EM algorithm for computing 
the parameters of a Gaussian mixture model. 

An important problem that this procedure has in com- 
mon with all methods that approximate a function as 



a combination of local models is the trade-off between 
overall capacity and generalization capabilities. Here, 
we proposed a simple statistical criterion, namely, the 
average variance of the mean predicted response of each 
model, to find the optimal number of models. 

It is also important to eliminate redundancies after 
each step has converged; these redundancies are caused, 
in the first step, by the fact that the local models are 
spatially localized (so that a hyperplane that extends 
over a large region of the input space will be broken into 
several components to find a good approximation to its 
boundary). In the second step, there are Gaussians that 
approximate the data distribution away from the inter- 
class boundaries and are thus irrelevant for the classifi- 
cation task. In the proposed scheme, these redundancies 
are eliminated by merging similar hyperplanes (using a 
criterion based on the F statistic) after the first step, 
and by pruning irrelevant components after the second; 
this last method has the advantage of producing a model 
for the relevance of the examples, which may be used to 
guide the gathering of additional data. 

The results of the experiments we have performed, in- 
dicate the plausibility of this approach for general func- 
tion approximation tasks, and specially for surface re- 
construction problems like those that occur in image pro- 
cessing and computational vision. 
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FIGURE CAPTIONS 

Figure 1: (a) Original surface, (b) Reconstructed surface from 200 exam- 
ples, (c) True classification (d) Maximum likelihood classification ((3 = 8). 
(e) Maximum likelihood classification ((3 = 2). (f) Bayesian (MPM) classifi- 
cation. 



Figure 2: (a) Original surface, (b) Reconstructed surface from 200 examples 
(/? = 10). (c) Same as (b) with /? = 2. 



Figure 3: (a) Original noisy image, (b) Filtered image, (c) Segmentation 
(each grey level corresponds to a linear component). 



Figure 4: (a) Original image, (b) Optimal quantization, (c) Uniform quan- 
tization. 
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